Computation of the distribution of model accuracy statistics in machine learning: Comparison between analytically derived distributions and simulation‐based methods

Abstract Background and Aims All fields have seen an increase in machine‐learning techniques. To accurately evaluate the efficacy of novel modeling methods, it is necessary to conduct a critical evaluation of the utilized model metrics, such as sensitivity, specificity, and area under the receiver operator characteristic curve (AUROC). For commonly used model metrics, we proposed the use of analytically derived distributions (ADDs) and compared it with simulation‐based approaches. Methods A retrospective cohort study was conducted using the England National Health Services Heart Disease Prediction Cohort. Four machine learning models (XGBoost, Random Forest, Artificial Neural Network, and Adaptive Boost) were used. The distribution of the model metrics and covariate gain statistics were empirically derived using boot‐strap simulation (N = 10,000). The ADDs were created from analytic formulas from the covariates to describe the distribution of the model metrics and compared with those of bootstrap simulation. Results XGBoost had the most optimal model having the highest AUROC and the highest aggregate score considering six other model metrics. Based on the Anderson–Darling test, the distribution of the model metrics created from bootstrap did not significantly deviate from a normal distribution. The variance created from the ADD led to smaller SDs than those derived from bootstrap simulation, whereas the rest of the distribution remained not statistically significantly different. Conclusions ADD allows for cross study comparison of model metrics, which is usually done with bootstrapping that rely on simulations, which cannot be replicated by the reader.


| Dependent variable
The dependent variable of interest was a clinician's diagnosis of heart disease.

| Model construction and statistical analysis
Descriptive statistics for all patients and then patients stratified by heart disease were computed for all covariates and compared using χ 2 tests for categorical variables and t tests for continuous variables.
Machine learning methods including XGBoost, Random Forest, Artificial Neural Network, and Adaptive Boosting were implemented on the data set.

| Distribution evaluation
The distribution of each of the statistics was evaluated through comparison of summary statistics (minimum, 5th percentile, 25th percentile, 50th percentile, 75th percentile, 95th percentile, maximum, mean, SD) and the Anderson-Darling test for normality.

| Bootstrap simulation
A train-test set (70: 30) was used within all machine-learning models in this study. Bootstrap simulation (N = 10,000) simulations were carried out by permuting the train-test sets before training.

| AUROC
The AUROC = U n 2 where U has the Mann-Whitney distribution: As the Mann-Whitney distribution is asymptomatically convergent on the Gaussian distribution at large sample sizes, the mean and SD are sufficient statistics.
We further observe that for large n, the variance formula for the AUROC can be approximated as: σ = → = n n n n n AUROC 2 (2 + 1)

| DISCUSSION
We observed that the model metrics and model feature importance statistics for machine learning models converged on a Gaussian distribution in this retrospective, cross-sectional cohort of heart disease patients. ADDs were used to calculate sufficient Gaussian distribution statistics, including the mean and SD. It was found that there was no significant difference in the overall distribution between the Gaussian approximation of the distribution for model metrics and feature importance statistics.
Bootstrapping has previously been the primary method used to derive accuracy statistics for machine learning model distributions. 18,20,24,25 Bootstrapping can generate a distribution based on data without any knowledge of the distribution and without violating any assumptions that are required to utilize a distribution for inference. 16 The study's findings can be broadly applied to research on machine learning. To begin, they can be persuaded to employ a variety of machine-learning techniques and choose the most effective one, rather than relying solely on a single point estimate.
Instead, a thorough evaluation of the estimate variances for the model metrics can be used to accurately determine which model is the most effective. As a result, we advocate that the strongest model is not only the one with the highest AUROC point estimate on a randomly selected seed but also the one with the highest distribution of multiple model accuracy statistics. 16,17,24,25,[37][38][39] Furthermore, the results of this study support that the distribution of each model metric follows a normal distribution and can be modeled analytically through the Gaussian distribution and the Whitney-Mann distribution for the AUROC, which we have termed the ADD pronounced the "AD distribution." F I G U R E 2 For the XGBoost models, the distribution of the gain statistic for all covariates: Age, Angina, Cholesterol, Fasting blood sugar (Fasting BS), Maximum heart rate (MaxHR), Resting blood pressure (RestingBP), Resting electrocardiogram (RestingECG), and Sex.